-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-8124] [SPARKR] [WIP] Created more examples on SparkR DataFrames #6668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here are more examples on SparkR DataFrames including creating a SQL context, loading data and simple data manipulation
|
@shivaram Here is the new submission. I would like to submit a few more examples on statistical modeling and machine learning on SparkR DataFrames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have the Apache License at the top of every file. You can see https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R#L1 for an example
Also per our style guide we don't put in Author names / dates in the file itself as this is tracked in the commit log
|
@shivaram I have added the Apache license at the top of every file, removed author name & date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should probably be 'Load SparkR library into your R session'
Now using sqlContext as the variable name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be describe and not Describe ?
provided two options for creating DataFrames. Option 1: from local data frames and option 2: directly create DataFrames using read.df function
Deleted the source() function and combined all the code into one file
Deleted the getting started file and combined all the code into one file
Renamed file to data-manipulation.R
|
@shivaram I wanted to provide two options for creating DataFrames. One where R users can convert their local dataframes into DataFrames and the second using the read.df(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would read.csv that is part of base R also work for this ? I know that data.table is more efficient, but I would like to avoid installing new of packages in the example.
Replaced the data.table function (fread) with base R function for reading csv files (read.csv)
|
@shivaram Yes, the base R function works. I have changed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we take this in as a command line argument ? I think something like
args <- commandArgs(trailing = TRUE)
if (length(args) != 1) {
print("Usage: data-manipulation.R <path-to-flights.csv")
print("The data can be downloaded from: https://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ")
q("no")
}
flightsCsvPath <- args[[1]]
should do the trick
Taking in data set as a command line argument
|
@shivaram I fixed that. You will notice that read.csv() does not work well with SSL, that is https connections. so I changed the connection to http. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be sparkRSQL and not SparkRSQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to run this locally and this step is very slow for the dataset we are using here (I filed https://issues.apache.org/jira/browse/SPARK-8277) due to the way we convert local data frames to lists.
I see two options here: (1) Use fewer rows in the example file, so that this runs fast or (2) use a different dataset to demonstrate creating a SparkR DataFrame from a local dataframe (the CSV reader is fine)
Let me know which you think is better.
To create a SparkR DataFrame, I used fewer rows of the local data frame.
|
@shivaram To create a Spark DataFrame from a local data frame, I used a subset of the data with fewer rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should also go inside the if block
|
Jenkins, ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source here needs to be com.databricks.spark.csv
BTW @rxin is there some way we can map source = csv to that automatically ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not if csv is outside this ... maybe we can provide a way for data sources to register short names.
|
Thanks @Emaasit for the update. I just had a few more things that I ran into while executing the example. Also you can verify some of these things by just running the example on your machine -- I just used a command of the form to check things |
|
Test build #34619 has finished for PR 6668 at commit
|
|
@shivaram Ok. Got you. |
|
LGTM. Thanks @Emaasit for this PR. There are some outstanding comments, but I'll fix them during the merge. |
|
Thanks @shivaram. |
Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL
context, loading data and simple data manipulation.